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Abstract 

We present in this work a new methodology to design kernels on data 
which is structured with smaller components, such as text, images or se- 
quences. This methodology is a template procedure which can be applied 
on most kernels on measures and takes advantage of a more detailed "bag 
of components" representation of the objects. To obtain such a detailed 
description, we consider possible decompositions of the original bag into a 
collection of nested bags, following a prior knowledge on the objects' struc- 
ture. We then consider these smaller bags to compare two objects both in 
a detailed perspective, stressing local matches between the smaller bags, 
and in a global or coarse perspective, by considering the entire bag. This 
multiresolution approach is likely to be best suited for tasks where the 
coarse approach is not precise enough, and where a more subtle mixture 
of both local and global similarities is necessary to compare objects. The 
approach presented here would not be computationally tractable without 
a factorization trick that we introduce before presenting promising results 
on an image retrieval task. 



1 Introduction 

There is strong evidence that kernel methods can deliver state-of-the-art 
performance on most classification tasks when the input data lies in a vector 
space. Arguably, two factors contribute to this success. First, the good abil- 
ity of kernel algorithms, such as the SVM, to generalize and provide a sparse 
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formulation for the underlying learning problem; Second, the capacity of non- 
linear kernels, such as the polynomial and RBF kernels, to quantify meaningful 
similarities between vectors, notably non-linear correlations between their com- 
ponents. Using kernel machines with non- vectorial data (e.g., in bioinformatics, 
pattern recognition or signal processing tasks) requires more arbitrary choices, 
both to represent the objects and to chose suitable kernels on those represen- 
tations. The challenge of using kernel methods on real-world data has thus 
recently fostered many proposals for kernels on complex objects, notably for 
strings, trees, images or graphs to cite a few. 

A strategy often quoted as the generative approach to this problem takes 
advantage of a generative model, that is an adequate statistical model for the 
objects, to derive feature representations for the objects. In practice this often 
yields kernels to be used on the histograms of smaller components sampled in 
the objects, where the kernels take into account the geometry of the underlying 
model in their similarity measures El El El EJ ■ The previous approaches cou- 
pled with SVM's combine both the advantages of using discriminative methods 
with generative ones, and produced convincing results on many tasks. 

One of the drawbacks of such representations is however that they implicitly 
assume that each component has been generated independently and in a sta- 
tionary way, where the empirical histogram of components is seen as a sample 
from an underlying stationary measure. While this viewpoint may translate 
into adequate properties for some learning tasks (such as translation or rotation 
invariance when using histograms of colors to manipulate images [3]), it might 
prove too restrictive and hence inadequate for other types of problems. Namely, 
tasks which involve a more subtle mix of detecting both conditional (with re- 
spect to the location of the components for instance) and global similarities 
between the objects. Such problems arc likely to arise for instance in speech, 
language, time series or image processing. In the first three tasks, this consid- 
eration is notably treated by most state-of-the-art methods through dynamic 
programming algorithms capable of detecting and penalizing accordingly local 
matches between the objects. Using dynamic programming to produce a kernel 
yielded fruitful results in different applications |14II12| . with the limitation that 
the kernels obtained in practice are not always positive definite, as reviewed 
in ^1]. Other kernels proposed for sequences directly incorporate a local- 
ization information into each component, augmenting considerably the size of 
the component space, and then introduce some smoothing (such as mismatches) 
to avoid representations that would be too sparse. 

We propose in this work a different approach grounded on the generative ap- 
proach previously quoted, managing however to combine both conditional and 
global similarities when comparing two objects. The motivation behind this 
approach is both intuitive and computational: intuitively, the global histogram 
of components, that is the simple bag of components representation of Figure^ 
may seem inadequate if the components' appearance seem to be clearly condi- 
tioned by some external events. This phenomenon can be taken into account 
by considering collections (indexed on the same set of events, to be defined) 
of nested bags or histograms to describe the object. Kernels that would only 
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Figure 1: From the bag of components representation to a set of nested bags, 
using a set of conditioning events. 



rely on these detailed resolutions might however miss the bigger picture that is 
provided by the global histogram. We propose a trade-off between both view- 
points through a combination that aims at giving a balanced account of both 
fine and coarse perspectives, hence the name of multiresolution kernels, which 
we introduce formally in Sectional On the computational side, we show how 
such a theoretical framework can translate into an efficient factorization detailed 
in Section El We then provide experimental results in Section El on an image 
retrieval task which shows that the methodology improves the performance of 
kernel based state-of-the art techniques in this field. 

2 Multiresolution Kernels 

In most applications, complex objects can be represented as histograms of com- 
ponents, such as texts as bags of words or images and sequences as histograms 
of colors and letters. Through this representation, objects are cast as proba- 
bility laws or measures on the space X of components, typically multinomials 
if X is finite [§1 El El EL an d compared as such through kernels on measures. 
An obvious drawback of this representation is that all contextual information 
on how the components have been sampled is lost, notably any general sense 
of position in the objects, but also more complex conditional information that 
may be induced from neighboring components, such as transitions or long range 
interactions. 

In the case of images for instance, one may be tempted to consider not 
only the overall histogram of colors, but also more specialized histograms which 
may be relevant for the task. If some local color-overlapping in the images 
is an interesting or decisive feature of the learning problem, these specialized 
histograms may be generated arbitrarily following a grid, dividing for instance 
the image into 4 equal parts, and computing histograms for each corner before 
comparing them pairwise between two images (see Figure El for an illustration). 
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If sequences are at stake, these may also be sliced into predefined regions to 
yield local histograms of letters. If the strings are on the contrary assumed 
to follow some Markovian behaviour (namely that the appearance of letters in 
the string is independent of their exact location but only depends on the few 
letters that precede them), an interesting index would translate into a set of 
contexts, typically a complete suffix dictionary as detailed in 0] . While the two 
previous examples may seem opposed in the way the histograms are generated, 
both methodologies stress a particular class of events (location or transitions) 
that give an additional knowledge on how the components were sampled in 
the objects. Since both these two approaches, and possibly other ones, can be 
applied within the framework of this paper using a unified formalism, we present 
our methodology using a general notation for the index of events. Namely, we 
note T for an arbitrary set of conditioning events, assuming these events can 
be directly observed on the object itself, by contrast with the latent variables 
approach of JHj- Considering still, following the generative approach, that 
an object can be mapped onto a probability measure fi on X, we have that 
the realization of an event t£ T can be interpreted under the light of a joint 
probability p(x,i), with x G X, factorized through Bayes' law as p(x\t)p(t) to 
yield the following decomposition of p as 

M = 

te T 

where each pt d = p(-\t)p(t) is an element of the set of sub-probability measures 
M+(X), that is the set of positive measures p on X such that their total mass 
p(X) denoted as \p\ is less than or equal to 1. To take into account the in- 
formation brought by the events in T, objects can hence be represented as 
families of measures of M1_(X) indexed by T, namely elements p contained in 

M T (X) d = M S + {X) T . 

2.1 Local Similarities Between Measures Conditioned by 
Sets of Events 

To compare two objects under the light of their respective decompositions as 
sub-probability measures p t and p' t , we make use of an arbitrary positive definite 
kernel k on Ml (X) to which we will refer to as the base kernel throughout the 
paper. For interpretation purposes only, we may assume in the following sections 
that k can be written as e~ d where d is an Euclidian distance in M+(X). Note 
also that the kernel is defined not only on probability measures, but also on 
sub-probabilities. For two elements p,p! of Mq-(X) and a given element t G T, 
the kernel 

k t {p,p') = k{p t ,p' t ) 

measures the similarity of p and p! by quantifying how similarly their compo- 
nents were generated conditionally to event t. For two different events s and t of 
T, k s and k t can be associated through polynomial combinations with positive 
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factors to result in new kernels, notably their sum k s + kt or their product k s kt- 
This is particularly adequate if some complementarity is assumed between s 
and t, so that their combination can provide new insights for a given learning 
task. If on the contrary the events are assumed to be similar, then they can be 
regarded as a unique event {s} U {t} and result in the kernel 

fc{ s }u{t}G">M') = k (Ms +Mt> Ms + Mi)> 

which will measure the similarity of m and m' when either s or t occurs. The 
previous formula can be extended to model kernels indexed on a set T C T of 
similar events, through 

kT(m,m') d = k (ht, fjf T ) , where /it = f /U* and [J,' T d == /i' t . 

t g t t e T 

Note that this equivalent to defining a distance between elements /i and \J 
conditionned by T as dj,(n, fx') = d 2 ([iT, Mr)' 

2.2 Resolution Specific Kernels 

Let P be a finite partition of T, that is a finite family P = (T%, ...,T n ) of sets of 
T, such that T t n Tj ; = if 1 < i < j < n and |J™ = i = T. We write V{T) for 
the set of all partitions of T. Consider now the kernel defined by a partition P 

as 

n 

fcp(M,M')= f n^(^M')- (i) 

i=l 

The kernel kp quantifies the similarity between two objects by detecting their 
joint similarity under all possible events of T, given an a priori similarity as- 
sumed on the events which is expressed as a partition of T. Note that there 
is some arbitrary in this definition since, following the convolution kernels [5] 
approach for instance, a simple multiplication of base kernels kx t to define kp is 
used, rather than any other polynomial combination. More precisely, the multi- 
plicative structure of Equation JTJ quantifies how two objects are similar given 
a partition P in a way that imposes for the objects to be similar according to 
all subsets Ti. If k can be expressed as a function of a distance d, kp can be 
expressed as the exponential of 

n 

i=i 

a quantity which penalizes local differences between the decompositions of 
and /i' over T, as opposed to the coarsest approach where P = {T} and only 
d 2 (/i,/i') is considered. 

As illustrated in Figure [5] in the case of images expressed as histograms in- 
dexed over locations, a partition of T reflects a given belief on how events should 
be associated to belong to the same set or dissociated to highlight interesting 
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Figure 2: A useful set of events T for images which would focus on pixel local- 
ization can be represented by a grid, such as the 8x8 one represented above. 
In this case P3 corresponds to the 4 3 windows presented in the left image, P2 to 
the 16 larger square obtained when grouping 4 small windows, Pi to the image 
divided into 4 equal parts and Pq is simply the whole image. Any partition 
of the image obtained from sets in Pq , such as the one represented above, can 
in turn be used to represent an image as a family of sub-probability measures, 
which reduces in the case of two-color images to binary histograms as illustrated 
in the right-most image. 



dissimilarities. Hence, all partitions contained in the set V(T) of all possible 
partitions 1 arc not likely to be equally meaningful given that some events may 
look more similar than others. If the index is based on location, one would 
naturally favor mergers between neighboring indexes. For contexts, a useful 
topology might also be derived by grouping contexts with similar suffixes. 

Such meaningful partitions can be obtained in a general case if we assume 
the existence of a prior hierarchical information on the elements of T, translated 
into a series 

P = {T},..,P D = {{t},tET} 

of partitions of T, namely a hierarchy on T. To provide a hierarchical content, 
the family (Pd)d=i is such that any subset present in a partition Pa is included 
in a (unique by definition of a partition) subset included in the coarser partition 
Pd— 1, and further assume this inclusion to be strict. This is equivalent to stating 
that each set T of a partition Pd is divided in Pd+i through a partition of T 
which is not T itself. We note this partition s(T) and name its elements the 
siblings of T. Consider now the subset Vd C V(T) of all partitions of T obtained 
by using only sets in 

Po° = U ^' 

d=l 

namely V D d = {P E T{T) s.t. VTe P,T E P D }.. The set V D contains both 
the coarsest and the finest resolutions, respectively P and P D , but also all 
variable resolutions for sets enumerated in P®, as can be seen for instance in 
the third image of Figure |21 

1 which is quite a big space, since if T is a finite set of cardinal r, the cardinal of the set of 
partitions is known as the Bell Number of order r with B r = - Y],°'L 1 ~ e rlnr . 
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2.3 Averaging Resolution Specific Kernels 

Each partition P contained in Vd provides a resolution to compare two objects, 
and generates consequently a very large family of kernels kp when P spans Vd- 
Some partitions are probably better suited for certain tasks than others, which 
may call for an efficient estimation of an optimal partition given a task. We take 
in this section a different direction by considering an averaging of such kernels 
based on a Bayesian prior on the set of partitions. In practice, this averaging 
favours objects which share similarities under a large collection of resolutions. 

Definition 1. Let T be an index set endowed with a hierarchy (Pd)d=0! 71 be 
a prior measure on the corresponding set of partitions Vd and k a base kernel 
on M^{X) x M^_(X). The multiresolution kernel k^ on Mr(X) x Mr{X) is 
defined as 

Mm,m')= E *(P)k P (M,n')- ( 2 ) 

PeVn 

Note that in Equation J3J, each resolution specific kernel contributes to the 
final kernel value and may be regarded as a weighted feature extractor. 

3 Kernel Computation 

This section aims at characterizing hierarchies (Pd)d=o anc ^ priors it for which 
the computation of k n is both tractable and meaningful. We first propose a 
type of hierarchy generated by trees, which is then coupled with a branching 
process prior to fully specify 7r. These settings yield a computational time for 
expressing k^ which is loosely upperbounded by D x cardT x c(fc) where c{k) 
is the time required to compute the base kernel. 

3.1 Partitions Generated by Branching Processes 

All partitions P of Vd can be generated iteratively through the following rule, 
starting from the initial root partition P := Pq = {T}. For each set T of P: 

1. either leave the set as it is in P, 

2. either replace it by its siblings enumerated in s(T), and reapply this rule 
to each sibling unless they belong to the finest partition Pp>. 

By giving a probabilistic content to the previous rule through a binomial pa- 
rameter (i.e. for each treated set assign probability 1 — e of applying rule 1 
and probability e of applying rule 2) a candidate prior for Vd can be derived, 
depending on the overall coarseness of the considered partition. For all elements 
T of Pd this binomial parameter is equal to 0, whereas it can be individually 
defined for any element T of the D — 1 coarsest partitions as et £ [0, 1], yielding 
for a partition P e Vd the weight 

<p)= n £- £ t) n ( £ ^)> 
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where the set P = {Tg P d s.t. 3V 6 P, V C T} gathers all coarser sets be- 
longing to coarser resolutions than P, and can be regarded as all ancestors in 
Pq of sets enumerated in P. 

3.2 Factorization 

The prior proposed in Section mi can be used to factorize the formula in |j2J, 
which is summarized in this theorem, using notations used in Definition ^ 

Theorem 1. For two elements m,m' of Mq-{X), define for T spanning recur- 
sively Pd, Pd-i, Po the quantity 

K T = {1 - e T )k T {n, Ij!) + s T Kxj. 

ue s(T) 

Then fc 7r (/x,/i') = Kq-. 

Proof. The proof follows from the prior structure used for the tree generation, 
and can be found in either £Q or 0] . Figure |21 underlines the importance of 
incorporating to each node Kt a weighted product of the kernels Kjj computed 
by its siblings. □ 




Figure 3: The update rule for the computation of kn takes into account the 
branching process prior by updating each node corresponding to a set T of any 
intermediary partitions with the values obtained for higher resolutions in s(T). 

If the hierarchy of T is such that the cardinality of s(T) is fixed to a constant 
a for any set T, typically a — 4 for images as seen in Figure 13 then the 
computation of is upperbounded by (a D+l — l)c(fc). This computational 
complexity may even become lower in cases where the histograms become sparse 
at fine resolutions, yielding complexities in linear time with respect to the size 
of the compared objects, quantified by the length of the sequences in 0] for 
instance. 
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4 Experiments 



We present in this section experiments inspired by the image retrieval task 
first considered in [2] and also used in [Hj, although the images used here are 
not exactly the same. The dataset was also extracted from the Corel Stock 
database and includes 12 families of labelled images, each class containing 100 
color images, each image being coded as 256 x 384 pixels with colors coded in 24 
bits (16M colors). The families depict bears, African specialty animals, monkeys, 
cougars, fireworks, mountains, office interiors, bonsais, sunsets, clouds, apes 
and rocks and gems. The database is randomly split into balanced sets of 900 
training images and 300 test images. The task consists in classifying the test 
images with the rule learned by training 12 one-vs-all SVM's on the learning 
fold. The object are then classified according to the SVM performing the highest 
score, namely with a "winner-takes-all" strategy. The results presented in this 
section are averaged over 4 different random splits. We used the CImg package 
to generate histograms and the Spider toolbox for the SVM experiments 2 . 

We adopted a coarser representation of 9 bits per color for the 98, 304 pixels 
of each image, rather than the 24 available ones to reduce the size of the RGB 
color space to 8 3 = 512 from the original set of 256 3 = 16, 777, 216 colors. In 
this image retrieval experiment, we used localization as the conditioning index 
set, dividing the images into 1,4,4 2 = 16,9 and 9 2 = 81 local histograms (in 
Figure |3 the image was for instance divided into 4 3 = 64 windows). To define 
the branching process prior, we simply set an uniform value over all the grid of e 
of 1/a, an usage motivated by previous experiments led in a similar context 0]. 
Finally, we used kernels described in both [2] and [Sj to define the base kernel 
k. These kernels can be directly applied on sub-probability measures, which is 
not the case for all kernels on multinomials, notably the Information Diffusion 
Kernel [§]. We report results for two families of kernels, namely the Radial 
Basis Function expressed for multinomials and the entropy kernel based on the 
Jensen divergence [HI E] : 

W*, 0') = e-'£ '^ a ' 6 , 9') = e-K^HO^+M*')). 

For most kernels not presented here, the multircsolution approach usually im- 
proved the performance in a similar way than the results presented in Tabled 
Finally, we also report that using only the finest resolution available in each 
(a,D) setting, that is a branching process prior uniformly set to 1, yielded bet- 
ter results than the use of the coarsest histogram without achieving however the 
same performance of the multiresolution averaging framework, which highlights 
the interest of taking both coarse and fine perspectives into account. When 
a = .25 for instance, this setting produced 16.5% and 16.2% error rates for 
a = 4 and D = 1, 2, and 15.8% for a = 9 and D = 1. 

2 http: / /ci mg.sourceforge.net/ and http://www.kyb.tuebingen.mpg.de/bs/people/spider/ 
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Kernel 


RBF, 
a = .25 


b = 1, p-- 
a = .5 


= .01 

a = 1 


JD 


global histogram 


18.5 


18.3 


18.3 


21.4 


D=l,a = 4 


15.4 


16.4 


18.8 


17 


D = 2,a = A 


13.9 


13.5 


15.8 


15.2 


D=l,a = 9 


14.7 


14.7 


16.6 


15 


D = 2, a = 9 


15.1 


15.1 


30.5 


15.35 



Table 1: Results for the Corel image database experiment in terms of error rate, 
with 4 fold cross-validation and 2 different types of tested kernels, the RBF and 
the Jensen Divergence. 
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